Exchange Server 2010 : Troubleshooting Methodology

10/24/2010 4:05:21 PM

Troubleshooting is part art and part science. This section presents one possible method to troubleshoot issues, but it is not necessarily the only approach you can take. The goal of troubleshooting is to quickly resolve the issue and find the root cause whenever possible. A feedback loop should be in place for making changes to the environment or processes and prevent the issue from happening again. Figure 1 outlines the troubleshooting process.
Figure 1. The troubleshooting process

1. Define the Scope

It sounds obvious, but defining the scope is a critical step in the troubleshooting process. Without clearly scoping and defining the problem, it's easy to get sidetracked or not have everybody working to solve the same problem. Verifying and recording as much information about the problem as possible is very important. You must also record the initial state of the environment before changes take place and that information is lost forever.

2. Collect the Data

This step involves collecting data from sources such as event logs, log files, and performance information. If possible, record the steps to reproduce the problem. Sometimes it helps to capture data from ancillary servers. For example, when troubleshooting performance issues with Exchange, you can get a complete picture by coordinating data captured from a client and server at the same time. On both the client and server use perfmon, network tracing, memory dumps, and other tools, and record the exact time that the issue occurs. This way you can look at the problem from several different sources to get a complete picture.

3. Correlate the Data

With the problem statement and the collected data, make a list of potential causes for the problem. It is also important in this step to pay attention to the problem details and error codes.

One helpful tool with this step is a dependency map. An example of a dependency map is shown in Figure 2.

Figure 2. High-level dependency map

From the high-level example in Figure 3, you can pull out an element and create a dependency map for it. For example, Figure 17-5 expands the Exchange service by specifically creating a map for the Exchange Mailbox role. It is possible to continue this process and pick another element, such as direct attached storage, and create a dependency map for it.

Creating these dependency maps helps quickly determine dependencies and can help focus the areas to troubleshoot.

Figure 3. Low-level dependency map

4. Rank the Causes

The next step is to make a list of the possible causes, adding additional emphasis on characteristics such as ease of implementation of solution, probability of cause, risk, and so on. This step will help focus on the most likely, easiest, or lowest-risk test solutions.

5. Work the Solutions

When it's time to work the solutions, start with whichever solution ranked highest from the previous step. Be careful not to make multiple changes at once, and record the changes made and any observations on the change's impact. Continue working the solutions until the problem is resolved.

6. Return to Operating State

It is important to undo any changes made while troubleshooting from the previous step. Logging, tracing, and other tools may have an impact on server performance, so removing any unnecessary settings is important.

7. Feedback Loop

Document the root cause and any details learned from troubleshooting and resolving the issue. If the problem resurfaces in the future, it's possible that a different group of people may be troubleshooting the issue. Good documentation preserves institutional knowledge that would be otherwise lost.